Asynchronous Polycyclic Architecture
نویسنده
چکیده
The Asynchronous Polycyclic Architecture (APA) is a new processor design for numerically intensive applications. APA resembles the VLIW architecture, in that it provides independent control and concurrent operation of low-level functional units within the processor. The main innovations of APA are the provision for multiple threads of control within each processor, the clustering of functional units into groups of functional units that show very weak coupling with each other, decoupled access/execute and eager execution. A supercomputer implementing this architecture is currently being designed, using commercially available parts. 1 — Introduction and motivation Development of the Asynchronous Polycyclic Architecture (APA) concept was spurred by the needs of Project Omicron, an academic research effort. The project’s main goal (expected to be achieved sometime in 1994) is to design and build a supercomputer for numerical applications, with a real-world performance in the same range of the then current supercomputers. APA processors are expected to provide better sustained performance in real-world problems than standard vector processors with the same peak capacity. Since the constraints of a typical University research budget excluded expensive solutions like custom ECL circuits or exotic packaging and cooling, we were forced to develop a novel architecture that is better matched to the needs of real-world computations than the standard vector processor design, but is still realizable by using commercially available components. We began our design effort with a critical evaluation of existing architectures. Current vector processors are machines architecturally designed in the early seventies. Since then much effort has been put in research in computer architectures, and many important results had been achieved, but had no influence on supercomputer architectures. A basic assumption was that there should be ways to use these results in high-performance machines, if there is no need of compatibility with older architectures. * This research was partly supported by the CNPQ grant 501971/91-2. Such an evaluation required a performance yardstick. Given our emphasis on sustained speed in real-world problems, we decided to compare the alternatives by their average performance on some representative benchmark, rather than by their theoretical peak performance under ideal conditions. As pointed out in [8], peak performance adds only to the price of a machine. Since numerical algorithms typically spend most of their time in the innermost loops, we chose a standard collection of simple loops, the Lawrence Livermore Kernels (LLK) [9] as our benchmark. The performance of many supercomputers in these loops is well known, and the loops are simple enough to be compiled and simulated by hand for all the alternative designs that we had to consider. It is instructive to consider that the real speed observed is a very small fraction of the maximum speed. This evidence strongly suggests that the traditional vector processor architecture is somewhat ill-adapted to the very tasks for which it was designed. The explanation for this paradox is quite simple: the vector architecture was designed with individual operations in mind, while the real problem is to process entire inner loops, not individual operations. Loops contain recurrences and conditionals, situations that are not considered in the design of vector architectures. In vector architectures, the main source of inefficiency is the lack or inadequacy of support for operation involving recurrences or conditionals. Following up on the RISC analogy, it seems reasonable to assume that, by breaking up the special instructions of standard vector processors into simpler ones, one would obtain substantially better overall performance from the same amount of hardware. What is needed is an architecture that can offer reasonable performance on general loops, containing recurrences and conditional tests, instead of one that concentrates on optimizing only a few restricted vector forms. Our proposed Asynchronous Polycyclic Architecture implements this principle by combining the basics of the VLIW architecture [6] with the decoupled access/execute concept [13], and with extensions for loop execution, for conditional execution of instructions, for fetching large amounts of data and for an autonomous operation of groups of functional units, and eager execution for hiding memory latency (in eager execution, a instruction is executed as soon as their operands are available and there is a free unit to do the operation, even when it is not sure if the control flow will warrant the need for the instruction; in lazy execution an instruction is executed only when reached by the control flow). Sections 2 and 3 describe the generic APA concept in more detail, and provide arguments in support of these claims. Section 4 briefly outlines an interconnect network specially designed for connecting the above processing units to the memory subsystem. Section 5 describes one specific instance of the architecture, the preliminary design of the Omicron supercomputer. Section 6 offers some concluding remarks 2 — Description of the APA The Asynchronous Polycyclic Architecture resulted from a critical analysis of the characteristics of the VLIW architecture. The detailed evolution leading to the APA can be found in [2]; it will be only summarized here. A VLIW processor [6] is conceptually characterized by a single thread of execution, a large number of data paths and functional units, with control planned at compile time, instructions providing enough bits to control the action of every functional unit directly and independently in each cycle, operations that require a small and predictable number of cycles to execute, and each operation can be pipelined, i. e., each functional unit can initiate a new operation in each cycle. On examining the conceptual characterization, it becomes clear that the rationale is to have a large degree of parallelism and a simple, and therefore fast, control cycle. Closer scrutiny of the conditions above shows that they are sufficient for the goal, but most are not necessary, at least in the length stated. The evolution took place in 4 steps: First, the access/execute concept [13] was introduced. The motivation was practical considerations on memory access. The ideal memory subsystem for any supercomputer should have large capacity, very low access times, and very high bandwidth, at a low cost. Real memories must compromise some of these goals. If a large capacity is required, cost and size limitations almost force the usage of relatively slow dynamic memories. Besides, numerically intensive applications require a very high memory bandwidth. General purpose architectures rely almost invariably on caches to speed up execution, exploiting the locality of memory references. Although this is the case with general computing loads, this assumption can be wildly wrong with numerically intensive programs, since dealing with large arrays is incompatible with any realistically sized cache. This conclusion is reported for quite different machines [1, 5, 12]. The way to go is to have extensive interleaving in the memory; the latency may be high, but as long as it is possible to maintain a large number of outstanding requests, it will be possible to obtain operands at the necessary rate. It is therefore unfeasible to have a predictable access time for the functional units in charge of memory accesses, due to the static unpredictability of memory bank conflicts. This can be circumvented by considering that the processor remains synchronous in virtual time, stalling if data is not available when expected, but this may have a substantially adverse effect on performance. The first new APA feature solves the memory access problem by decoupling the process of memory access. Two kinds of functional units are used: the first, called the address unit, generates and sends the required addresses to the memory subsystem; the second, called the data reference unit, is responsible for reordering data words coming from memory and upon request sending them to the other functional units. This is an asynchronous counterpart to the decoupled access/execution concept introduced by [13]. Address units may operate in two modes: single address and multiple address. In the single address mode, its role is only to receive an address calculated by an arithmetic unit and to send it to the memory subsystem; in the multiple address mode, its role is to autonomously generate the values of a set of arithmetic progressions, until a specified number of elements are generated; this mode is used for reference to (a set of) arrays. Once started, the address unit can proceed asynchronously with the main flow of control: it generates as many addresses as the memory subsystem can accept, waits if the memory subsystem cannot accept new requests, resumes execution when this condition is withdrawn and stops when all addresses have been generated. As a consequence, the memory subsystem operates at full capacity and the main flow of control is not disturbed by saturation of the memory subsystem. As for hardware, fewer bits are required in the instruction, since the units are controlled by their own instructions, and also fewer ports are required in the central register file, since the address processors can use a local copy of the required registers. The second step was extending this concept of asynchronous operations to the other functional units, each with its own flow of control. Experience in programing such a machine shows that it is unduly complicated. Very few situations require this full splitting of the functional units. A hierarchical system is adequate for almost all situations: the functional units can be divided into groups, composed of a certain number of arithmetic units, each capable of forking the operation of address units. The third step resulted from the fact that experience in programming also shows that the communication between groups is infrequent. An important consequence is that there is no need to share a central register file among groups; each group can use its own, provided it is able to send values to the other groups. This communication is done by a private bus connecting the functional units together. Three additional features are required for an efficient usage of above characteristics: eager execution, delayed interrupts and a tagged memory system. To keep the functional units busy, values should be calculated as soon as possible, regardless of the possibility of the control flow rendering them unnecessary. A consequence is that abnormal conditions (like division by zero, references to invalid addresses, etc) can arise. To postpone the resulting interrupts, every word is composed of a tag and a value fields, both in the central register file and in memory. When an operation encounters an error condition, a number associated with this error condition is placed in the tag, and the address of the offending instruction is placed in the value field of the result word. All operations performed with a word containing a non-zero tag will have the original word as its result, thus preserving the error nature and the address of the offending instruction. If both operands have non-zero tags, one is arbitrarily chosen for the result. Real interrupts occurs only when the word’s value is effectively used. A value is effectively used when the computation can not proceed without the use of value. This may be a somewhat elusive concept, and a precise characterization is beyond the scope of this paper. In a simple approach, a value is effectively used when used as a final result, in output operations; in a more restrictive context, when used in an unavoidable operation, as determined by the execution flow graph. The hardware must have special instructions that generate an interrupt when a value is effectively used. It is up for the compilers do determine when this is the case. Efficient execution of loops is obtained by use of a variant of polycyclic support; since its basics are described elsewhere [4, 12], it will not be described in detail here. The main difference from the implementation used in the Cydra 5, described in the above references, is that there are no explicit predicates; predicates are implied in the "age" of the iteration. This allows the hardware to support automatic prolog and epilog generation. The hardware also supports predicate-controlled execution of instructions. Figure 1 Evolution from VLIW to APA architecture In this drawing, dotted lines represent flow of control, AL stands for Arithmetic Logic Unit, and AU for Address unit. In A, the traditional VLIW architecture. In B, evolution to the access/execute architecture. In C, the introduction of groups. Experimental analysis of programs for this architecture shows that each autonomous set may have its own copy of the register file, as shown in D. Data is exchanged via an internal bus. Other essential features for making this possible are not shown. The result of these characteristics is a processor with higher performance as compared to an equivalent VLIW, implemented with register files with fewer ports, and with dynamic instruction word size: each group is small, and at any moment only the active ones must fetch instructions. For instance, the machine resulting from the dimensioning studied in the next section uses register files with only two write ports, and instructions of only 80 bits. Figure 1 depicts the evolution from the VLIW to the APA architecture. In short, the architecture of one APA group is characterized by: 1 a large number of data paths and functional units, with control planned at compile time; 2 functional units which are divided into groups; each group has its own register file and its own flow of control; 3 instructions providing enough bits to control the action of every functional unit directly and independently in each cycle; 4 operations that require a small and predictable number (in virtual time) of cycles to execute; 5 each operation can be pipelined, i. e., each functional unit can initiate a new operation in each cycle. 6 memory accesses are decoupled; 7 hardware support for loop execution; 8 eager executions and delayed interrupts. An APA processor is composed of a set of groups sharing a common memory. 3 — Exploitation of parallelism As defined above, a group is a unit of execution. It can be considered as a dual of the pipe in vector machines. In the same way a vector processor can have several pipes, an APA processor may have several groups. These several groups share a common view of memory, and must be interconnected to a common memory subsystem. Here again, the problem is not one of latency, but one of bandwidth. Section 4 describes a interconnection network, called the Omicron Network, specially designed for this application. With a processor having several groups, fine-grain parallelism is exploited within groups; medium-grain parallelism, if present in loops, is exploited within a processor. Coarse-grain parallelism can be exploited by several processors; in this regard, the APA is not different from other architectures. Exploitation of the medium-grain parallelism, which is a distinctive characteristic of the APA, is done in blocks contained in innermost loops. We call inner block all the code contained inside an inner loop; it is not necessarily a basic block, since it may contain conditionals; however, if it contains procedure calls, the inner loop condition allows only the call of procedures that do not contain loops. The following considerations are based on simple compiler technology; the use of program transformations and other techniques already developed for vector machines will lead to considerable increase in performance. Every inner block can be characterized in one of the following categories: 3.1– Blocks without data or control dependencies These blocks can be divided among a sufficient number of groups to allow the use of all bandwidth to memory and all functional units. This category also includes a few particular cases of easily resolved cases of data dependency, like the sum of products. As an example, the loop
منابع مشابه
0rogramming the !synchronous 0olycyclic !rchitecture
Development of the Asynchronous Polycyclic Architecture (APA) concept was spurred by the needs of Project Omicron, an academic research effort. The project’s main goal is to design and build a supercomputer for numerical applications, with a real-world performance in the same range of the then current supercomputers. APA processors are expected to provide better sustained performance in real-wo...
متن کاملDecentralized and Cooperative Multi-Sensor Multi-Target Tracking With Asynchronous Bearing Measurements
Bearings only tracking is a challenging issue with many applications in military and commercial areas. In distributed multi-sensor multi-target bearings only tracking, sensors are far from each other, but are exchanging data using telecommunication equipment. In addition to the general benefits of distributed systems, this tracking system has another important advantage: if the sensors are suff...
متن کاملFPGA Architecture for Multi-Style Asynchronous Logic
This paper presents a novel FPGA architecture for implementing various styles of asynchronous logic. The main objective is to break the dependency between the FPGA architecture dedicated to asynchronous logic and the logic style. The innovative aspects of the architecture are described. Moreover the structure is well suited to be rebuilt and adapted to fit with further asynchronous logic evolut...
متن کاملArchitecture of an Asynchronous FPGA for Handshake-Component-Based Design
This paper presents a novel architecture of an asynchronous FPGA for handshake-component-based design. The handshakecomponent-based design is suitable for large-scale, complex asynchronous circuit because of its understandability. This paper proposes an areaefficient architecture of an FPGA that is suitable for handshake-componentbased asynchronous circuit. Moreover, the Four-Phase Dual-Rail en...
متن کاملAn FPGA Based on Synchronous/Asynchronous Hybrid Architecture with Area-Efficient FIFO Interfaces
This paper presents an FPGA architecture that combines synchronous and asynchronous architectures. Datapath components such as logic blocks and switch blocks are designed so as to run in asynchronous and synchronous modes. Moreover, a logic block is presented that implements area-efficient First-in-first-out(FIOF) interfaces, which are usually used for communication between synchronous and asyn...
متن کاملBuilding Parallel Distributed Models for Asynchronous Computer Architectures
Recently, there has been a resurgence of interest in asynchronous design techniques. Asynchronous logic provides a solution to the clock-related timing problems of synchronous systems and can offer higher performance and lower power consumption. This paper presents an approach for modeling and simulating asynchronous computer architectures using occam as a description language.
متن کامل